Beta Distribution — Modeling Uncertain Probabilities#

The beta distribution is the workhorse distribution for random variables constrained to \([0,1]\). It is especially important for modeling uncertain probabilities (e.g., conversion rates, success probabilities, fractions).

What you’ll learn#

  • what the beta distribution models and why it’s useful

  • the PDF/CDF and how it relates to the beta function and gamma function

  • closed-form moments (mean/variance/skewness/kurtosis), MGF/CF, and entropy

  • how \((lpha,eta)\) control shape, mean, and concentration

  • a NumPy-only sampler (via Gamma sampling) + Monte Carlo validation

  • practical usage via scipy.stats.beta (pdf, cdf, rvs, fit)

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from scipy import stats, special

# Plotly rendering (CKC convention)
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

# Reproducibility
rng = np.random.default_rng(7)

np.set_printoptions(precision=4, suppress=True)

1) Title & Classification#

  • Name: Beta distribution

  • Type: Continuous

  • Support: \(x \in [0, 1]\) (density is defined for \(0 < x < 1\))

  • Parameters: shape parameters \(lpha > 0\), \(eta > 0\)

We write:

\[X \sim \mathrm{Beta}(lpha, eta).\]

2) Intuition & Motivation#

2.1 What it models#

The beta distribution models a random proportion or random probability. If you know a quantity must live in \([0,1]\) (a fraction, rate, probability, mixture weight), the beta distribution is often the first candidate.

2.2 Real-world use cases#

  • Conversion rate / click-through rate: unknown probability of success

  • A/B testing: uncertainty over two proportions

  • Reliability / defect rates: fraction of items that fail

  • Normalized quantities: e.g., fraction of budget, fraction of time spent

  • Random mixing weights: blend two components with a random weight \(w \in [0,1]\)

2.3 Relations to other distributions#

  • Conjugate prior for Bernoulli/Binomial likelihoods (Beta–Bernoulli / Beta–Binomial).

  • Special cases:

    • \(\mathrm{Beta}(1,1)\) is Uniform on \([0,1]\).

    • \(\mathrm{Beta}( frac{1}{2}, frac{1}{2})\) is the arcsine distribution.

  • Order statistics: if \(U_{(k)}\) is the \(k\)-th order statistic of \(n\) iid Uniform\((0,1)\) samples, $\(U_{(k)} \sim \mathrm{Beta}(k, n+1-k).\)$

  • Gamma connection: if \(G_1 \sim \mathrm{Gamma}(lpha,1)\) and \(G_2 \sim \mathrm{Gamma}(eta,1)\) independently, $\( rac{G_1}{G_1 + G_2} \sim \mathrm{Beta}(lpha,eta).\)$

  • Dirichlet: Beta is the 2D special case of a Dirichlet distribution.

3) Formal Definition#

3.1 Beta function#

The beta function is

\[B(lpha,eta) = \int_0^1 t^{lpha-1}(1-t)^{eta-1}\,dt = rac{\Gamma(lpha)\Gamma(eta)}{\Gamma(lpha+eta)}.\]

3.2 PDF#

For \(0 < x < 1\):

\[f(x\midlpha,eta) = rac{1}{B(lpha,eta)}\,x^{lpha-1}(1-x)^{eta-1}.\]

3.3 CDF#

The CDF can be written using the regularized incomplete beta function \(I_x(lpha,eta)\):

\[F(x\midlpha,eta) = I_x(lpha,eta) = rac{\int_0^x t^{lpha-1}(1-t)^{eta-1}\,dt}{B(lpha,eta)}.\]

In SciPy, the CDF is computed via this special function.

def beta_pdf(x: np.ndarray, alpha: float, beta: float) -> np.ndarray:
    '''Numerically stable beta PDF via log-space (uses SciPy special functions).'''
    x = np.asarray(x, dtype=float)
    log_pdf = (
        (alpha - 1) * np.log(x)
        + (beta - 1) * np.log1p(-x)
        - special.betaln(alpha, beta)
    )
    return np.exp(log_pdf)


def beta_cdf(x: np.ndarray, alpha: float, beta: float) -> np.ndarray:
    '''Beta CDF via regularized incomplete beta I_x(alpha, beta).'''
    x = np.asarray(x, dtype=float)
    return special.betainc(alpha, beta, x)


# Quick sanity check: PDF integrates to ~1
alpha0, beta0 = 2.0, 5.0
xgrid = np.linspace(1e-6, 1 - 1e-6, 200_000)
area = np.trapz(beta_pdf(xgrid, alpha0, beta0), xgrid)
area
0.9999999999225003

4) Moments & Properties#

4.1 Mean, variance, skewness, kurtosis#

For \(X \sim \mathrm{Beta}(lpha,eta)\):

  • Mean $\(\mathbb{E}[X] = rac{lpha}{lpha+eta}.\)$

  • Variance $\(\mathrm{Var}(X) = rac{lphaeta}{(lpha+eta)^2(lpha+eta+1)}.\)$

  • Skewness $\(\gamma_1 = rac{2(eta-lpha)\sqrt{lpha+eta+1}}{(lpha+eta+2)\sqrt{lphaeta}}.\)$

  • Excess kurtosis (kurtosis minus 3) $\(\gamma_2 = rac{6ig[(lpha-eta)^2(lpha+eta+1) - lphaeta(lpha+eta+2)ig]} {lphaeta(lpha+eta+2)(lpha+eta+3)}.\)$

A useful compact identity for raw moments is:

\[\mathbb{E}[X^k] = rac{(lpha)_{k}}{(lpha+eta)_{k}},\]

where \((a)_k = a(a+1)\cdots(a+k-1)\) is the rising factorial (Pochhammer symbol).

4.2 Mode#

If \(lpha > 1\) and \(eta > 1\) (unimodal interior case),

\[\mathrm{mode} = rac{lpha-1}{lpha+eta-2}.\]

4.3 MGF and characteristic function#

Because the support is bounded, the MGF exists for all real \(t\):

\[M_X(t) = \mathbb{E}[e^{tX}] = {}_1F_1(lpha;lpha+eta;t),\]

where \({}_1F_1\) is the confluent hypergeometric function. The characteristic function is

\[ arphi_X(t) = M_X(it) = {}_1F_1(lpha;lpha+eta;it).\]

4.4 Differential entropy#

The differential entropy is

\[h(X) = \ln B(lpha,eta) - (lpha-1)\psi(lpha) - (eta-1)\psi(eta) + (lpha+eta-2)\psi(lpha+eta),\]

where \(\psi\) is the digamma function.

def beta_moments(alpha: float, beta: float) -> dict:
    a, b = float(alpha), float(beta)
    mean = a / (a + b)
    var = a * b / ((a + b) ** 2 * (a + b + 1))
    skew = 2 * (b - a) * np.sqrt(a + b + 1) / ((a + b + 2) * np.sqrt(a * b))
    excess_kurt = (
        6
        * (((a - b) ** 2) * (a + b + 1) - a * b * (a + b + 2))
        / (a * b * (a + b + 2) * (a + b + 3))
    )
    mode = np.nan
    if a > 1 and b > 1:
        mode = (a - 1) / (a + b - 2)

    mgf = lambda t: special.hyp1f1(a, a + b, t)

    entropy = (
        special.betaln(a, b)
        - (a - 1) * special.digamma(a)
        - (b - 1) * special.digamma(b)
        + (a + b - 2) * special.digamma(a + b)
    )

    return {
        "mean": mean,
        "var": var,
        "skew": skew,
        "excess_kurtosis": excess_kurt,
        "mode": mode,
        "entropy": entropy,
        "mgf": mgf,
    }


m = beta_moments(alpha0, beta0)
{k: v for k, v in m.items() if k != "mgf"}
{'mean': 0.2857142857142857,
 'var': 0.025510204081632654,
 'skew': 0.5962847939999439,
 'excess_kurtosis': -0.12,
 'mode': 0.2,
 'entropy': -0.48453071499548805}
# Monte Carlo check of mean/variance + MGF at a few t
n = 200_000
samples_scipy = stats.beta(alpha0, beta0).rvs(size=n, random_state=rng)

mc_mean = samples_scipy.mean()
mc_var = samples_scipy.var(ddof=0)

mc_mgf_1 = np.mean(np.exp(1.0 * samples_scipy))
mc_mgf_m1 = np.mean(np.exp(-1.0 * samples_scipy))

(
    m["mean"],
    mc_mean,
    m["var"],
    mc_var,
    m["mgf"](1.0),
    mc_mgf_1,
    m["mgf"](-1.0),
    mc_mgf_m1,
)
(0.2857142857142857,
 0.2850010733170977,
 0.025510204081632654,
 0.02540504391457564,
 1.3483340379497217,
 1.3473002240829173,
 0.7608141393691706,
 0.7613179409507954)

5) Parameter Interpretation#

The parameters \((lpha,eta)\) control both where the mass is (mean) and how concentrated it is.

5.1 Mean–concentration reparameterization#

A common and very interpretable reparameterization uses:

  • mean \(m \in (0,1)\)

  • concentration \(\kappa > 0\)

with

\[lpha = \kappa m, \qquad eta = \kappa(1-m).\]
  • Increasing \(\kappa\) (holding \(m\) fixed) makes the distribution tighter around \(m\).

  • Moving \(m\) (holding \(\kappa\) fixed) shifts the mass left/right.

5.2 Shape regimes#

  • \(lpha = eta = 1\): uniform

  • \(lpha,eta > 1\): unimodal, finite density at boundaries

  • \(lpha < 1\) or \(eta < 1\): density diverges near 0 and/or 1 (still integrable)

  • \(lpha = eta\): symmetric around 0.5

x = np.linspace(1e-4, 1 - 1e-4, 600)

param_sets = [
    (0.5, 0.5, "U-shaped (0.5,0.5)"),
    (1.0, 1.0, "Uniform (1,1)"),
    (2.0, 2.0, "Symmetric peak (2,2)"),
    (2.0, 5.0, "Skewed right (2,5)"),
    (5.0, 2.0, "Skewed left (5,2)"),
    (8.0, 1.5, "Mass near 1 (8,1.5)"),
]

fig = go.Figure()
for a, b, label in param_sets:
    fig.add_trace(go.Scatter(x=x, y=beta_pdf(x, a, b), mode="lines", name=label))

fig.update_layout(
    title="Beta PDF for different (α, β)",
    xaxis_title="x",
    yaxis_title="density",
    width=900,
    height=450,
)
fig
# Same mean, different concentration κ
m_fixed = 0.3
kappas = [2, 5, 20, 100]

fig = go.Figure()
for kappa in kappas:
    a = kappa * m_fixed
    b = kappa * (1 - m_fixed)
    fig.add_trace(go.Scatter(x=x, y=beta_pdf(x, a, b), mode="lines", name=f"κ={kappa}"))

fig.update_layout(
    title="Same mean (m=0.3), increasing concentration κ",
    xaxis_title="x",
    yaxis_title="density",
    width=900,
    height=420,
)
fig

6) Derivations#

6.1 Expectation#

Starting from the definition:

egin{align} \mathbb{E}[X] &= \int_0^1 x,f(x\midlpha,eta),dx
&= rac{1}{B(lpha,eta)}\int_0^1 x^{lpha}(1-x)^{eta-1},dx
&= rac{B(lpha+1,eta)}{B(lpha,eta)}. \end{align}

Using \(B(lpha,eta)= rac{\Gamma(lpha)\Gamma(eta)}{\Gamma(lpha+eta)}\):

egin{align} rac{B(lpha+1,eta)}{B(lpha,eta)} &= rac{\Gamma(lpha+1)\Gamma(eta),\Gamma(lpha+eta)}{\Gamma(lpha+eta+1),\Gamma(lpha)\Gamma(eta)}
&= rac{lpha,\Gamma(lpha),\Gamma(lpha+eta)}{(lpha+eta),\Gamma(lpha+eta)}
&= rac{lpha}{lpha+eta}. \end{align}

6.2 Variance#

Compute \(\mathbb{E}[X^2]\) similarly:

egin{align} \mathbb{E}[X^2] &= rac{1}{B(lpha,eta)}\int_0^1 x^{lpha+1}(1-x)^{eta-1},dx
&= rac{B(lpha+2,eta)}{B(lpha,eta)} = rac{lpha(lpha+1)}{(lpha+eta)(lpha+eta+1)}. \end{align}

Then

\[\mathrm{Var}(X)=\mathbb{E}[X^2]-\mathbb{E}[X]^2 = rac{lphaeta}{(lpha+eta)^2(lpha+eta+1)}.\]

6.3 Likelihood (iid sample)#

For data \(x_1,\ldots,x_n \in (0,1)\) iid from \(\mathrm{Beta}(lpha,eta)\):

egin{align} L(lpha,eta) &= \prod_{i=1}^n rac{1}{B(lpha,eta)} x_i^{lpha-1}(1-x_i)^{eta-1}
\ell(lpha,eta) &= \log L(lpha,eta)
&= -n\log B(lpha,eta) + (lpha-1)\sum_{i=1}^n \log x_i + (eta-1)\sum_{i=1}^n \log(1-x_i). \end{align}

Maximizing \(\ell(lpha,eta)\) (MLE) has no closed-form solution in general; it’s typically solved numerically. SciPy’s beta.fit does this for you.

def beta_loglikelihood(x: np.ndarray, alpha: float, beta: float) -> float:
    x = np.asarray(x, dtype=float)
    if np.any((x <= 0) | (x >= 1)):
        return -np.inf

    n = x.size
    return (
        -n * special.betaln(alpha, beta)
        + (alpha - 1) * np.sum(np.log(x))
        + (beta - 1) * np.sum(np.log1p(-x))
    )


# Example log-likelihood value
beta_loglikelihood(samples_scipy[:1000], alpha0, beta0)
493.07429490684376

7) Sampling & Simulation#

A numerically convenient way to sample from a beta distribution uses the Gamma ratio identity:

  1. Sample \(G_1 \sim \mathrm{Gamma}(lpha, 1)\)

  2. Sample \(G_2 \sim \mathrm{Gamma}(eta, 1)\)

  3. Return $\(X = rac{G_1}{G_1 + G_2}.\)$

So the core task is: sample Gamma(shape, scale=1) using only NumPy.

7.1 Gamma sampling: Marsaglia–Tsang#

For shape \(k \ge 1\), Marsaglia & Tsang (2000) propose an efficient acceptance–rejection method. For \(k < 1\), we can use the reduction:

\[\mathrm{Gamma}(k) \overset{d}{=} \mathrm{Gamma}(k+1)\,U^{1/k}, \quad U\sim\mathrm{Uniform}(0,1).\]

We implement this below.

def gamma_rvs_numpy(shape: float, size: int, rng: np.random.Generator) -> np.ndarray:
    '''Sample Gamma(shape, scale=1) using NumPy only (Marsaglia-Tsang).

    Parameters
    ----------
    shape:
        k > 0
    size:
        number of samples
    rng:
        NumPy Generator
    '''

    k = float(shape)
    if k <= 0:
        raise ValueError("shape must be > 0")

    # k < 1: boost to k+1 and apply power transform
    if k < 1:
        g = gamma_rvs_numpy(k + 1.0, size, rng)
        u = rng.random(size)
        return g * (u ** (1.0 / k))

    # k >= 1: Marsaglia-Tsang
    d = k - 1.0 / 3.0
    c = 1.0 / np.sqrt(9.0 * d)

    out = np.empty(size, dtype=float)
    filled = 0

    while filled < size:
        n = size - filled
        x = rng.standard_normal(n)
        v = (1.0 + c * x)
        v = v * v * v  # (1 + c x)^3
        u = rng.random(n)

        positive = v > 0

        # First (cheap) acceptance
        accept = positive & (u < 1.0 - 0.0331 * (x**4))

        # Second acceptance (log test) - compute log(v) only where v > 0 to avoid warnings
        log_v = np.zeros_like(v)
        log_v[positive] = np.log(v[positive])

        accept2 = positive & (~accept) & (
            np.log(u) < 0.5 * x * x + d * (1.0 - v + log_v)
        )

        accept = accept | accept2
        accepted = d * v[accept]

        take = min(accepted.size, n)
        out[filled : filled + take] = accepted[:take]
        filled += take

    return out


def beta_rvs_numpy(alpha: float, beta: float, size: int, rng: np.random.Generator) -> np.ndarray:
    '''Sample Beta(alpha, beta) using Gamma ratio with NumPy-only Gamma sampler.'''
    g1 = gamma_rvs_numpy(alpha, size, rng)
    g2 = gamma_rvs_numpy(beta, size, rng)
    return g1 / (g1 + g2)


# Monte Carlo validation against theory
n = 200_000
samples_numpy = beta_rvs_numpy(alpha0, beta0, n, rng)

np.mean(samples_numpy), np.var(samples_numpy), m["mean"], m["var"]
(0.2857694499926869,
 0.025682506385920862,
 0.2857142857142857,
 0.025510204081632654)
# Compare NumPy-only sampler to SciPy sampler (quick KS test)
ks = stats.ks_2samp(samples_numpy[:20_000], samples_scipy[:20_000])
ks
KstestResult(statistic=0.00824999999999998, pvalue=0.5014359645922872, statistic_location=0.2744678864589129, statistic_sign=-1)

8) Visualization#

We’ll visualize:

  • the theoretical PDF and CDF

  • Monte Carlo samples from our NumPy-only sampler

# PDF + histogram (Monte Carlo)
x = np.linspace(1e-4, 1 - 1e-4, 800)

fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=samples_numpy,
        nbinsx=60,
        histnorm="probability density",
        name="Monte Carlo (NumPy-only)",
        opacity=0.55,
    )
)
fig.add_trace(
    go.Scatter(
        x=x,
        y=stats.beta(alpha0, beta0).pdf(x),
        mode="lines",
        name="True PDF (SciPy)",
        line=dict(width=3),
    )
)

fig.update_layout(
    title=f"Beta({alpha0}, {beta0}): histogram vs PDF",
    xaxis_title="x",
    yaxis_title="density",
    width=900,
    height=420,
)
fig
# CDF: theoretical vs empirical
x = np.linspace(0, 1, 600)

emp_x = np.sort(samples_numpy)
emp_cdf = np.arange(1, emp_x.size + 1) / emp_x.size

fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=stats.beta(alpha0, beta0).cdf(x), mode="lines", name="True CDF"))
fig.add_trace(
    go.Scatter(
        x=emp_x[::200],
        y=emp_cdf[::200],
        mode="markers",
        name="Empirical CDF (subsampled)",
        marker=dict(size=4, opacity=0.6),
    )
)

fig.update_layout(
    title=f"Beta({alpha0}, {beta0}): theoretical CDF vs empirical CDF",
    xaxis_title="x",
    yaxis_title="CDF",
    width=900,
    height=420,
)
fig

9) SciPy Integration (scipy.stats.beta)#

SciPy parameterization:

stats.beta(a=alpha, b=beta, loc=0, scale=1)
  • a, b are the shape parameters \(lpha\), \(eta\).

  • loc and scale allow you to shift/scale the support from \([0,1]\) to \([ ext{loc}, ext{loc}+ ext{scale}]\).

dist = stats.beta(alpha0, beta0)  # loc=0, scale=1 by default

x = np.linspace(0, 1, 6)

pdf = dist.pdf(x)
cdf = dist.cdf(x)
samples = dist.rvs(size=5, random_state=rng)

pdf, cdf, samples
(array([0.    , 2.4576, 1.5552, 0.4608, 0.0384, 0.    ]),
 array([0.    , 0.3446, 0.7667, 0.959 , 0.9984, 1.    ]),
 array([0.3445, 0.4805, 0.0498, 0.2403, 0.4281]))
# Fitting (MLE) with SciPy
# If you KNOW the data live on [0, 1], it's common to fix loc=0 and scale=1.

a_hat, b_hat, loc_hat, scale_hat = stats.beta.fit(samples_numpy[:10_000], floc=0, fscale=1)
a_hat, b_hat, loc_hat, scale_hat
(1.9849242586340088, 4.951405202339398, 0, 1)

10) Statistical Use Cases#

10.1 Hypothesis testing / confidence intervals for proportions#

A classic exact confidence interval for a Binomial proportion (Clopper–Pearson) uses Beta quantiles.

If \(K \sim \mathrm{Binomial}(n, p)\) and you observe \(k\) successes, a two-sided \((1-lpha)\) interval is:

  • lower endpoint: \(\mathrm{Beta}^{-1}(lpha/2;\; k,\; n-k+1)\)

  • upper endpoint: \(\mathrm{Beta}^{-1}(1-lpha/2;\; k+1,\; n-k)\)

(where \(\mathrm{Beta}^{-1}\) is the inverse CDF / quantile function).

10.2 Bayesian modeling (Beta–Bernoulli / Beta–Binomial)#

Let \(p\) be an unknown success probability.

  • Prior: \(p \sim \mathrm{Beta}(lpha_0,eta_0)\)

  • Data: \(k\) successes and \(n-k\) failures

Posterior is conjugate:

\[p\mid ext{data} \sim \mathrm{Beta}(lpha_0+k,eta_0+n-k).\]

This is often interpreted as adding pseudo-counts to observed counts.

10.3 Generative modeling#

  • Sample random probabilities \(p\) for Bernoulli events.

  • Sample mixing weights \(w\) for two-component mixtures.

  • Generalization: Dirichlet for \(K\)-way mixture weights.

# Example: Clopper–Pearson interval
n = 100
k = 37
alpha_level = 0.05

if k == 0:
    cp_low = 0.0
else:
    cp_low = stats.beta.ppf(alpha_level / 2, k, n - k + 1)

if k == n:
    cp_high = 1.0
else:
    cp_high = stats.beta.ppf(1 - alpha_level / 2, k + 1, n - k)

(cp_low, cp_high)
(0.2755665796145515, 0.47235164055168316)
# Example: Bayesian update for a Bernoulli probability
alpha_prior, beta_prior = 2.0, 2.0

alpha_post = alpha_prior + k
beta_post = beta_prior + (n - k)

prior = stats.beta(alpha_prior, beta_prior)
post = stats.beta(alpha_post, beta_post)

x = np.linspace(1e-4, 1 - 1e-4, 600)

fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=prior.pdf(x), mode="lines", name=f"Prior Beta({alpha_prior:.0f},{beta_prior:.0f})"))
fig.add_trace(go.Scatter(x=x, y=post.pdf(x), mode="lines", name=f"Posterior Beta({alpha_post:.0f},{beta_post:.0f})", line=dict(width=3)))

fig.update_layout(
    title=f"Bayesian update (n={n}, k={k})",
    xaxis_title="p",
    yaxis_title="density",
    width=900,
    height=420,
)
fig
# Posterior probability of beating a threshold (Bayesian hypothesis-style query)
threshold = 0.4
post_prob = 1 - post.cdf(threshold)
post_mean = post.mean()
post_prob, post_mean
(0.29530027558732863, 0.375)

11) Pitfalls#

  • Invalid parameters: \(lpha \le 0\) or \(eta \le 0\) is not a valid beta distribution.

  • Boundary data (0 or 1):

    • The continuous Beta model assigns probability 0 to exact endpoints.

    • Log-likelihood involves \(\log x\) and \(\log(1-x)\), which become \(-\infty\) at 0 or 1.

    • If your data contain many exact 0/1 values, consider a zero/one-inflated beta model or add a measurement model.

  • Numerical issues near 0 or 1:

    • For \(lpha<1\) or \(eta<1\), the density can diverge near the boundaries; use log-space (betaln, logpdf).

    • When plotting, avoid evaluating exactly at 0 or 1; use small epsilons.

  • Fitting:

    • stats.beta.fit will estimate loc and scale unless fixed.

    • If data are already in \([0,1]\), using floc=0, fscale=1 avoids spurious shifts/scales.

12) Summary#

  • The beta distribution is a flexible family on \([0,1]\), ideal for modeling uncertain probabilities.

  • \((lpha,eta)\) control shape; \(m=lpha/(lpha+eta)\) is the mean and \(\kappa=lpha+eta\) acts like a concentration.

  • Key formulas (mean/variance/skewness/kurtosis) are closed-form; the CDF uses the regularized incomplete beta function.

  • Sampling is easy via the Gamma ratio identity; SciPy provides a robust reference implementation (scipy.stats.beta).

References

  • Marsaglia, G. & Tsang, W. W. (2000). A Simple Method for Generating Gamma Variables.

  • SciPy documentation: scipy.stats.beta, scipy.special.betainc, scipy.special.hyp1f1.